Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations #718

Merged
merged 15 commits into from
Aug 28, 2024

Conversation

alexklibisz
Copy link
Owner

@alexklibisz alexklibisz commented Aug 25, 2024

Related Issue

#715

Changes

I have some ideas for improvements to the ArrayHitCounter. Most of the ideas are tightly-coupled to the way the DocIdSetIterator is constructed, so I'd like to bring that functionality into the HitCounter. That's what this PR is doing.

Testing and Validation

Added new tests and re-ran benchmarks

@alexklibisz alexklibisz changed the title Performance: add DocIdSetIterator to ArrayHitCounter Performance: build DocIdSetIterator in ArrayHitCounter Aug 26, 2024
@joancf
Copy link

joancf commented Aug 26, 2024

@alexklibisz I was going to do a pr, to solve the issue that the old HitCounter had, returning documents with zero hits, I see that you still have the error,
the simplest way to solve it is

I also analyzed a bit you code, lets compute the time if a shard has D documents, H of which are a hit in any of the L hash tables.
The current version has a complexity or O(H) to fill the table and O(D) in time and space to find the top ones.
I also thought that maybe using a hash key/freq could be better, but numbers are difficult to be compared
Using a hashmap, the time to put the data is the hash is O(H) , then a an array of lists (List[] ) of size max_freq, could be created to store them by freq, the time needed to fill them, is (O(H) )
The time and space needed depends on the values of L,K
as for example a value 10 for K will select 0,1% of the documents, and if L=100 , this implies H to be <10% of the documents
I think that using a hashmap can be smaller in space and time.

@joancf
Copy link

joancf commented Aug 26, 2024

let me add a further comment about L and K
K=1 implies a division of the space in 2 parts, by a random plane, so we should expect that 50% of the documents are at each side.
I we do K=1 and L=m, then the probability of a document to be a positive hit for any L is 1-(prob of being a negative one for all L) this is 1-(0,5^m) , which means that with L >10 is 99,9%
So kith k=1 and L=99 we are considering almost all the documents.
with K = 10 and L=99 the prob of being a hit in any hashtable is 0,5^10 , and being a negative one so
at the end we come with the probability of a document to be a hit (at leas in one has table) to be
1−(2k−1)l​
Here are the computed values of the expression ( 1 - \left(\frac{2^k - 1}{2^k}\right)^l ) for different values of ( k ) (from 1 to 10) and ( l ) (from 10 to 100, in steps of 10):

k \ l 10 20 30 40 50 60 70 80 90 100
1 0.999023 0.999999 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 0.943686 0.996829 0.999821 0.99999 0.999999 1.0 1.0 1.0 1.0 1.0
3 0.736924 0.930791 0.981793 0.99521 0.99874 0.999669 0.999913 0.999977 0.999994 0.999998
4 0.47554 0.724941 0.855743 0.924343 0.960321 0.97919 0.989086 0.994276 0.996998 0.998426
5 0.272024 0.470051 0.61421 0.719154 0.795551 0.851166 0.891652 0.921125 0.942581 0.9582
6 0.145709 0.270187 0.376528 0.467373 0.544982 0.611282 0.667922 0.716309 0.757645 0.792958
7 0.075435 0.145179 0.209663 0.269282 0.324403 0.375367 0.422486 0.466051 0.506329 0.543569
8 0.038383 0.075293 0.110786 0.144916 0.177737 0.209298 0.239647 0.268832 0.296896 0.323884
9 0.01936 0.038346 0.056964 0.075222 0.093126 0.110684 0.127901 0.144785 0.161343 0.17758
10 0.009723 0.019351 0.028886 0.038328 0.047678 0.056937 0.066106 0.075186 0.084178 0.093083

These values represent how the expression changes as ( k ) and ( l ) increase.

so a value of k=10 and L =100 will select 1/10 of the documents but with L = 100 and K < 6 selects most of the documents. This impacts you benchmarks here #160 (comment) , as you always use k=4 l and L =100 , so 99% of the documents are a hit!

@alexklibisz alexklibisz marked this pull request as ready for review August 28, 2024 18:18
@alexklibisz alexklibisz changed the title Performance: build DocIdSetIterator in ArrayHitCounter Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations Aug 28, 2024
@alexklibisz alexklibisz merged commit c5a8e21 into main Aug 28, 2024
5 checks passed
@alexklibisz alexklibisz deleted the 160-better-counter branch August 28, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants